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Abstract. Th<0 method of stable random projections is a tool for effi- 
ciently computing the l a distances using low memory, where < a < 2 
is a tuning parameter. The method boils down to a statistical estimation 
task and various estimators have been proposed, based on the geometric 
mean, the harmonic mean, and the fractional power etc. 
This study proposes the optimal quantile estimator, whose main op- 
eration is selecting, which is considerably less expensive than taking 
fractional power, the main operation in previous estimators. Our exper- 
iments report that the optimal quantile estimator is nearly one order 
of magnitude more computationally efficient than previous estimators. 
For large-scale learning tasks in which storing and computing pairwise 
distances is a serious bottleneck, this estimator should be desirable. 
In addition to its computational advantages, the optimal quantile estima- 
tor exhibits nice theoretical properties. It is more accurate than previous 
estimators when a > 1. We derive its theoretical error bounds and es- 
tablish the explicit (i.e., no hidden constants) sample complexity bound. 

1 Introduction 

The method of stable random vroiections \ll2l'3\ , as an efficient tool for comput- 
ing pairwise distances in massive high-dimensional data, provides a promising 
mechanism to tackle some of the challenges in modern machine learning. In this 
paper, we provide an easy-to-implement algorithm for stable random projections 
which is both statistically accurate and computationally efficient. 

1.1 Massive High-dimensional Data in Modern Machine Learning 

We denote a data matrix by A S R™ x£) , i.e., n data points in D dimensions. 
Data sets in modern applications exhibit important characteristics which impose 
tremendous challenges in machine learning [3]: 

- Modern data sets with n = 10 5 or even n = 10 6 points are not uncommon in 
supervised learning, e.g., in image/text classification, ranking algorithms for 
search engines, etc. In the unsupervised domain (e.g., Web clustering, ads 
clickthroughs, word/term associations), n can be even much larger. 

- Modern data sets are often of ultra high-dimensions (D), sometimes in the 
order of millions (or even higher), e.g., image, text, genome (e.g., SNP), etc. 
For example, in image analysis, D may be 10 3 x 10 3 = 10 6 if using pixels as 
features, or D = 256 3 « 16 million if using color histograms as features. 

- Modern data sets are sometimes collected in a dynamic streaming fashion. 

1 First draft Feb. 2008, slightly revised in June 2008. The results were announced in 
January 2008 at SODA'08 when the author presented the work of [2]. 
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- Large-scale data are often heavy-tailed, e.g., image and text data. 

Some large-scale data are dense, such as image and genome data. Even for 
data sets which are sparse, such as text, the absolute number of non-zeros may 
be still large. For example, if one queries "machine learning" (a not-too-common 
term) in Google.com, the total number of pagehits is about 3 million. In other 
words, if one builds a term-doc matrix at Web scale, although the matrix is 
sparse, most rows will contain large numbers (e.g., millions) of non-zero entries. 

1.2 Pairwise Distances in Machine Learning 

Many learning algorithms require a similarity matrix computed from pairwise 
distances of the data matrix A £ R TlXjD . Examples include clustering, nearest 
neighbors, multidimensional scaling, and kernel SVM (support vector machines). 
The similarity matrix requires 0{n 2 ) storage space and 0(n 2 D) computing time. 

This study focuses on the l a distance (0 < a < 2). Consider two vectors u%, 
U2 6 M. D (e.g., the leading two rows in A), the l a distance between u\ and ui is 

D 

d( a ) =^2\u!a — U-2,i\ a ■ (1) 

Note that, strictly speaking, the l a distance should be defined as dV?. Be- 
cause the power operation (.) 1/a is the same for all pairs, it often makes no 
difference whether we use d 1 ^ or just d( Q ); and hence we focus on <i( a ). 

The radial basis kernel (e.g., for SVM) is constructed from d^ a ) |5l6j : 

K(wi,u 2 ) =exp -ui,i| a V 0<a<2. (2) 

When a — 2, this is the Gaussian radial basis kernel. Here a can be viewed as 
a tuning parameter. For example, in their histogram-based image classification 
project using SVM, [5] reported that a — and a = 0.5 achieved good perfor- 
mance. For heavy-tailed data, tuning a has the similar effect as term-weighting 
the original data, often a critical step in a lot of applications |7l8j . 

For popular kernel SVM solvers including the Sequential Minimal Optimiza- 
tion (SMO) algorithm [9 , storing and computing kernels is the major bottleneck. 
Three computational challenges were summarized in [4j page 12]: 

Computing kernels is expensive 

Computing full kernel matrix is wasteful Efficient SVM solvers 

often do not need to evaluate all pairwise kernels. 

Kernel matrix does not fit in memory Storing the kernel matrix 
at the memory cost 0(n 2 ) is challenging when n > 10 5 , and is not realistic 
for n > 10 6 , because O (10 12 ) consumes at least 1000 GBs memory. 

A popular strategy in large-scale learning is to evaluate distances on the 
fly [4]. That is, instead of loading the similarity matrix in memory at the cost of 
0(n 2 ), one can load the original data matrix at the cost of 0(nD) and recompute 
pairwise distances on-demand. This strategy is apparently problematic when D 



3 



is not too small. For high-dimensional data, either loading the data matrix in 
memory is unrealistic or computing distances on-demand becomes too expensive. 

Those challenges are not unique to kernel SVM; they are general issues in 
distanced-based learning algorithms. The method of stable random projections 
provides a promising scheme by reducing the dimension D to a small k (e.g., 
k = 50), to facilitate compact data storage and efficient distance computations. 

1.3 Stable Random Projections 

The basic procedure of stable random projections is to multiply A £ M. nxD 
by a random matrix R £ IR- Dxfc (k <C D), which is generated by sampling each 
entry ry i.i.d. from a symmetric stable distribution S(a, 1). The resultant matrix 
B = A x R G M. nxk is much smaller than A and hence it may fit in memory. 

Suppose a stable random variable x ~ S(a, d), where d is the scale parameter. 
Then its characteristic function (Fourier transform of the density function) is 

E (exp {yf^lxt)) = exp (-d\t\ a ) , 

which does not have a closed-form inverse except for a — 2 (normal) or a = 1 
(Cauchy). Note that when a — 2, d corresponds to "cr 2 " (not "er") in a normal. 

Corresponding to the leading two rows in A, iti, U2 £ M. D , the leading two 
rows in B are v\ = R T iti, t>2 = R T U2- The entries of the difference, 

D / D 

i=l \ i=l 

for j = 1 to fc, are i.i.d. samples from a stable distribution with the scale pa- 
rameter being the l a distance d( Q ), due to properties of Fourier transforms. For 
example, when a — 2, a weighted sum of i.i.d. standard normals is also normal 
with the scale parameter (i.e., variance) being the sum of squares of all weights. 

Once we obtain the stable samples, one can discard the original matrix A 
and the remaining task is to estimate the scale parameter rf( Q ) for each pair. 

Some applications of stable random projections are summarized as follows: 

- Computing all pairwise distances The cost of computing all pair- 
wise distances of A e R rix15 , 0(n 2 D), is significantly reduced to 0(nDk+n 2 k). 

- Estimating l a distances online For n > 10 5 , it is challenging or 
unrealistic to materialize all pairwise distances in A. Thus, in applications 
such as online learning, databases, search engines, and online recommenda- 
tion systems, it is often more efficient if we store B 6 M. nxk in the memory 
and estimate any distance on the fly if needed. Estimating distances online 
is the standard strategy in large-scale kernel learning[4]. With stable random 
projections, this simple strategy becomes effective in high-dimensional data. 

- Learning with dynamic streaming data In reality, the data ma- 
trix may be updated overtime. In fact, with streaming data arriving at high- 
rate PHO], the "data matrix" may be never stored and hence all operations 
(such as clustering and classification) must be conducted on the fly. The 
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method of stable random projections provides a scheme to compute and up- 
date distances on the fly in one-pass of the data; see relevant papers (e.g., 
PQ) for more details on this important and fast-developing subject. 

- Estimating entropy The entropy distance Ylf=i iia,»| log U2,i \ is 
a useful statistic. A workshop in NIPS'03 (www.menem. com/~ilya/pages/NIPS03 ) 
focused on entropy estimation. A recent practical algorithm is simply us- 
ing the difference between the l ai and l a2 distances [11], where a\ = 1.05, 
0L2 — 0.95, and the distances were estimated by stable random projections. 

If one tunes the l a distances for many different a (e.g., [5]), then stable 
random projections will be even more desirable as a cost-saving device. 

2 The Statistical Estimation Problem 

Recall that the method of stable random projections boils down to a statistical 
estimation problem. That is, estimating the scale parameter from k i.i.d. 
samples Xj ~ S(a, d( Q )), j = 1 to k. We consider that a good estimator d( a ) 
should have the following desirable properties: 

- (Asymptotically) unbiased and small variance. 

- Computationally efficient. 

- Exponential decrease of error (tail) probabilities. 

The arithmetic mean estimator A E =i \xj\' 2 is good for a = 2. When a < 2, 
the task is less straightforward because (1) no explicit density of Xj exists unless 
a = 1 or 0+; and (2) E(|xj|*) < oo only when — 1 < t < a. 

2.1 Several Previous Estimators 

Initially reported in arXiv in 2006, [5] proposed the geometric mean estimator 

n k \ rr -.\ a / k 

[f^(tK(i-i)^(lt)] 

where -T(.) is the Gamma function, and the harmonic mean estimator 

? -fr(-a) sin (fa) / / -gT(-2a) sin (tto) 

= ; — ; « — 7 — ; ; : ttt — 1 

E-=xM- a V V [r(-a)sin(fa)] 2 

More recently, [3] proposed the fractional power estimator 

v-^fc I lA*a \ 1 / A " 

dr, 



k |r(l-A*)r(A*a)sin(fA*a) 



where 



i _ lj_ /J_ _ A / |r(l-2A*)r(2A-"Q)sin(^A*Q) _ x 
fc2A* ^A* J \ v [|r(l-A*)r(A*a)sin(fA*a)] 2 

1 / ^r(l - 2A)r(2Aa)sin(7rAa) 
A = argmin - 1 — 



i,A<i A2 V[§ni-A)r(Aa)sin(fAa)]' 




0.20.40.60.8 1 1.21.41.61.8 2 

a 

Fig. 1. The Cramer- Rao efficiencies (the higher the better, max = 100%) of 
various estimators, including the optimal quantile estimator proposed in this 
study. 

All three estimators are unbiased or asymptotically (as k — * oo) unbiased. 
Figure [1] compares their asymptotic variances in terms of the Cramer- Rao effi- 
ciency, which is the ratio of the smallest possible asymptotic variance over the 
asymptotic variance of the estimator, as k — > oo. 

The geometric mean estimator, di a \ gm exhibits tail bounds in exponential 
forms, i.e., the errors decrease exponentially fast: 



Pr Id 



1(a), gm 



> ed( a ) J < 2 exp 



The harmonic mean estimator, dr a \hm> works well for small a, and has ex- 
ponential tail bounds for a = 0+. 

The fractional power estimator, di a yf p , has smaller asymptotic variance than 
both the geometric mean and harmonic mean estimators. However, it docs not 
have exponential tail bounds, due to the restriction — 1 < A* a < a in its definition. 
As shown in 3 , it only has finite moments slightly higher than the 2nd order, 
when a approaches 2 (because A* — > 0.5), meaning that large errors may have a 
good chance to occur. We will demonstrate this by simulations. 



2.2 The Issue of Computational Efficiency 



In the definitions of d, 



(a),gr. 



^(a),/ir 



and d 



all three estimators require 



evaluating fractional powers, e.g., \xj\ a / k . This operation is relatively expensive, 
especially if we need to conduct this tens of billions of times (e.g., n 2 = 10 10 ). 

For example, [5] reported that, although the radial basis kernel ^ with 
a = 0.5 achieved good performance, it was not preferred because evaluating the 
square root was too expensive. 

2.3 Our Proposed Estimator 

We propose the optimal quantile estimator, using the <?*th smallest \xj\: 



d 



(a),oq 



(q*-quimti\e{\xj\,j = 1,2, ...,k}) a 



(3) 



where q* = q*(a) is chosen to minimize the asymptotic variance. 
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This estimator is computationally attractive because selecting should be 
much less expensive than evaluating fractional powers. If we are interested in 



d 1 /? instead, then we do not even need to evaluate any fractional powers. 



As mentioned, in many cases using either or d x J? makes no difference 
and dtoA is often preferred because it avoids taking (.) 1 ' Q power. The radial basis 
kernel ([2|) requires Thus this study focuses on On the other hand, if 
we can estimate d 1 /? directly, for example, using §3§ without the ath power, we 

might as well just use d 1 ^ if permitted. In case we do not need to evaluate any 
fractional power, our estimator will be even more computationally efficient. 

In addition to the computational advantages, this estimator also has good 
theoretical properties, in terms of both the variances and tail probabilities: 

1. Figure [T] illustrates that, compared with the geometric mean estimator, its 
asymptotic variance is about the same when a < 1, and is considerably 
smaller when a > 1. Compared with the fractional power estimator, it has 
smaller asymptotic variance when 1 < a < 1.8. In fact, as will be shown by 
simulations, when the sample size k is not too large, its mean square errors 
are considerably smaller than the fractional power estimator when a > 1. 

2. The optimal quantile estimator exhibits tail bounds in exponential forms. 
This theoretical contribution is practically important, for selecting the sam- 
ple size k. In learning theory, the generalization bounds are often loose. In 
our case, however, the bounds are tight because the distribution is specified. 

The next section will be devoted to analyzing the optimal quantile estimator. 

3 The Optimal Quantile Estimator 

Recall the goal is to estimate d( Q ) from {xj}j =1 , where Xj ~ S(a, d(o,)), i.i.d. Since 
the distribution belongs to the scale family, one can estimate the scale parameter 
from quantiles. Due to symmetry, it is natural to consider the absolute values: 

- _ / g-QuantilelJzjl, j = l,2,...,fc} \° 
{a) ' q \ g-Quairfcile{ \S(a, 1)|} J ' (> 

which is best understood by the fact that if x ~ S(a, 1), then d^^x ~ S(a,d), or 
more obviously, if x ~ iV(0, 1), then (o- 2 ) 1/2 x ~ N (0,ct 2 ). By properties of order 
statistics |12j . any g-quantile will provide an asymptotically unbiased estimator. 
Lemma [T] provides the asymptotic variance of d( a ),q- 

Lemma 1. Denote fx (x;a,d( a )) and Fx (x;a,d( a 'f) the probability density func- 
tion and the cumulative density function of X ~ S(a,d( a )), respectively. 
The asymptotic variance of d^ q defined in is 

where W = F^ 1 ((q + l)/2; a, 1) = q-Quantile{\S(a, 1)|}. 
Proof: See Appendix\fi\ □. 
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3.1 Optimal Quantile q*(ot) 

We choose q — q*(a) so that the asymptotic variance ((5]) is minimized, i.e., 

_ 2 

q*(a) = argmin g(q;a), g(q;a) = f2 (y^- a 1)W 2 ' ^ 

The convexity of g(q] a) is important. Graphically, g(q; a) is a convex function 
of q, i.e., a unique minimum exists. An algebraic proof, however, is difficult. 
Nevertheless, we can obtain analytical solutions when a = 1 and a = 0+. 

Lemma 2. When a = 1 or a = 0+. the junction g(q; a) defined in {6p is a 
convex function of q. When a = 1, the optimal q*(l) — 0.5. When a = 0+, 
<f (0+) = 0.203 is the solution to ~\ogq* + 2q* - 2 = 0. 
Proof: See AvvendixWl □. 

It is also easy to show that when a = 2, g* (2) = 0.862. 

We denote the optimal quantile estimator by d( Q ) i0g , which is same as d( a \ q * . 
For general a, we resort to numerical solutions, as presented in Figure [5] 




0.20.40.60.8 1 1.21.41.61.8 2 0.20.40.60.8 1 1.21.41.61.8 2 



a a 
(a) q* (b) W a (q*) 

Fig. 2. (a) The optimal values for q*(a), which minimizes asymptotic vari- 
ance of d,t a \ q , i.e., the solution to ©. (b) The constant W a (q*) = 
{g*-quantile{|5(a,l)|}} Q . 

3.2 Bias Correction 

Although di a ), q (i- e -i d(a),q") is asymptotically (as k — > oo) unbiased, it is 
seriously biased for small k. Thus, it is practically important to remove the bias. 
The unbiased version of the optimal quantile estimator is 

d(a),oq,c = d(a),og/-Ba,)b, (7) 

where B a ^ is the expectation of d(a),oq & t ^(a) = 1- For a = 1, 0+, or 2, 
we can evaluate the expectations (i.e., integrals) analytically or by numerical 
integrations. For general a, as the probability density is not available, the task 
is difficult and prone to numerical instability. On the other hand, since the 
Monte-Carlo simulation is a popular alternative for evaluating difficult integrals, 
a practical solution is to simulate the expectations, as presented in Figure [31 

Figure [3J illustrates that B a ,k > 1, meaning that this correction also reduces 
variance while removing bias (because Var(a:/c) = Var(a;)/c 2 ). For example, when 




a = 0.1 and k = 10, B a ^ « 1.24, which is significant, because 1.24 2 = 1.54 
implies a 54% difference in terms of variance, and even more considerable in 
terms of the mean square errors MSE = variance + bias 2 . 

B a ^ can be tabulated for small k, and absorbed into other coefficients, i.e., 
this does not increase the computational cost at run time. We fix B a ^ as reported 
in Figure [3] The simulations in Section [4] directly used those fixed B a ^ values. 

3.3 Computational Efficiency 

Figure [4] compares the computational costs of the geometric mean, the fractional 
power, and the optimal quantile estimators. The harmonic mean estimator was 
not included as it costs very similarly to the fractional power estimator. 

We used the build-in function "pow" in gec for evaluating the fractional pow- 
ers. We implemented a "quick select" algorithm, which is similar to quick sort 
and requires on average linear time. For simplicity, our implementation used 
recursions and the middle element as pivot. Also, to ensure fairness, for all esti- 
mators, coefficients which are functions of a and/or k were pre-computed. 




Fig. 4. Relative computational cost (d( a ),<?m over d( a ),o 9 , c and d( a ) iSm over d( a )jp), 
from 10 6 simulations at each combination of a and k. The left panel averages 
over all k and the right panel averages over all a. Note that the cost of d( Q ), oq , c 
includes evaluating the ath moment once. 

Normalized by the computing time of d( a ), gm , we observe that relative com- 
putational efficiency does not strongly depend on a. We do observe that the ratio 
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of computing time of d( a ) igm over that of d/ a \ oqiC increases consistently with in- 
creasing k. This is because in the definition of d^ a )„ q (and hence also it 
is required to evaluate the fractional power once, which contributes to the total 
computing time more significantly at smaller k. 

Figure [4] illustrates that, (A) the geometric mean estimator and the frac- 
tional power estimator are similar in terms of computational efficiency; (B) the 
optimal quantile estimator is nearly one order of magnitude more computation- 
ally efficient than the geometric mean and fractional power estimators. Because 
we implemented a "naive" version of "quick select" using recursions and simple 
pivoting, the actual improvement may be more significant. Also, if applications 
require only d 1 /?, then no fractional power operations are needed for d^ >oq>c and 
the improvement will be even more considerable. 

3.4 Error (Tail) Bounds 

Error (tail) bounds are essential for determining k. The variance alone is not 
sufficient for that purpose. If an estimator of d, say d, is normally distributed, d ~ 
N (d, tV), the variance suffices for choosing k because its error (tail) probability 
Pr (jd- d\ > ed^j < 2exp (—k^J is determined by V. In general, a reasonable 
estimator will be asymptotically normal, for small enough e and large enough k. 
For a finite k and a fixed e, however, the normal approximation may be (very) 
poor. This is especially true for the fractional power estimator, d( a )j p . 

Thus, for a good motivation, Lemma [3] provides the error (tail) probability 
bounds of di a \ q for any q, not just the optimal quantile q*. 

Lemma 3. Denote X ~ S(a,d^) and its probability density function by fx{x;a,d^ 
and cumulative function by Fx{x;a, d( a )). Given Xj ~ S(a,d( a )), i.i.d., j — 1 to 
k. Using d^ q in Hty, then 

2 



G 

e 



R.q 

2 



Pr (d(a), g > (1 + e)d(a)) ^ ex P \ k (^R~) ,e> °' ® 

Pr (d (a) , q < (1 - e)d (a) ) < exp (^-k-^—^j , < e < 1, (9) 

-(1 - q) log (2 - 2F R ) - glog(2F J j - 1) + (1 - q) log(l - q) + glogg, (10) 

-(1-q) log (2 - 2F L ) - glog(2F L - 1) + (1 - q) log(l - q) + glog g, (11) 



w = F^iil + l)/2; a, 1) = q-quanUle{\S(a, 1)|}, 



Fr — Fx 
As e ^ 



((1 + e) 1/a W; a, l) , F L = F x ((1 - e) 1/a W; a, l) 



lim Gfl, q = lim Gh,q = g) f/ T L ■ (12) 

f^o+ E ^o+ /| (VF; a, l)W 2 



Proof: See Appendix^ □ 
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The limit in (|12j) as e — > is precisely twice the asymptotic variance factor of 
d( a ),q m ©j consistent with the normality approximation mentioned previously. 
This explains why we express the constants as e 2 /G. (fT2f also indicates that the 
tail bounds achieve the "optimal rate" for this estimator, in the language of large 
deviation theory. 

By the Bonferroni bound, it is easy to determine the sample size k 
Pr(\d (aU -d (a) \ >«Z {Q) ) <2exp(-*^ < */(n a /2) =>■ k > ^ (2 log n - log 5) . 

Lemma 4. Using d{ a ),q with k > ^ (21ogn — log<5), any pairwise l a distance 
among n points can be approximated within a lie factor with probability > 1 — 8. 
It suffices to let G = max{Gjj, 9 , Gh, q }, where Gn, q , Gh, q are defined in Lemma\^ 

The Bonferroni bound can be unnecessarily conservative. It is often reason- 
able to replace <5/(n 2 /2) by 6/T, meaning that except for a l/T fraction of pairs, 
any distance can be approximated within a 1 ± e factor with probability 1 — 5. 

Figure [5] plots the error bound constants for e < 1, for both the recom- 
mended optimal quantile estimator dt a \ oq and the baseline sample median esti- 
mator d( a ),q=o.5- Although we choose di a ) )0q based on the asymptotic variance, 
it turns out d^ a )^ oq also exhibits (much) better tail behaviors (i.e., smaller con- 
stants) than d( a ),(j=o.5j at least in the range of e < 1. 




Consider k = % (log2T — log S) (recall we suggest replacing n 2 /2 by T), with 
S = 0.05, e = 0.5, and T — 10. Because G_r )(J » w5~9 around e = 0.5, we obtain 
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k 120 ~ 215, which is still a relatively large number (although the original 
dimension D might be 10 6 ). If we choose e = 1, then approximately k « 40 ~ 65. 

It is possible k = 120 ~ 215 might be still conservative, for three reasons: 
(A) the tail bounds, although "sharp," are still upper bounds; (B) using G = 
max{G/? i9 » , Gl, 9 * } is conservative because Gr,,q* is usually much smaller than 
Gfl„ 9 *; (C) this type of tail bounds is based on relative error, which may be 
stringent for small (« 0) distances. 

In fact, some earlier studies on normal random projections (i.e., a — 2) [13|14j 
empirically demonstrated that k > 50 appeared sufficient. 

4 Simulations 

We resort to simulations for comparing the finite sample variances of various 
estimators and assessing the more precise error (tail) probabilities. 

One advantage of stable random projections is that we know the (manually 
generated) distributions and the only source of errors is from the random number 
generations. Thus, we can simply rely on simulations to evaluate the estimators 
without using real data. In fact, after projections, the projected data follow 
exactly the stable distribution, regardless of the original real data distribution. 

Without loss of generality, we simulate samples from S(a, 1) and estimate the 
scale parameter (i.e., 1) from the samples. Repeating the procedure 10 7 times, 
we can reliably evaluate the mean square errors (MSE) and tail probabilities. 

4.1 Mean Square Errors (MSE) 

As illustrated in Figure [6l in terms of the MSE, the optimal quantile estimator 
d( a ),oq,c outperforms both the geometric mean and fractional power estimators 
when a > 1 and k > 20. The fractional power estimator does not appear to be 
very suitable for a > 1, especially for a close to 2, even when the sample size k is 
not too small (e.g., k = 50). For a < 1, however, the fractional power estimator 
has good performance in terms of MSE, even for small k. 

4.2 Error(Tail) Probabilities 

Figure [Jj presents the simulated right tail probabilities, Pr (d( Q ) > (l + e)d( a )j, 
illustrating that when a > 1, the fractional power estimator can exhibit very 
bad tail behaviors. For a < 1, the fractional power estimator demonstrates good 
performance at least for the probability range in the simulations. 

Thus, Figure [7] demonstrates that the optimal quantile estimator consistently 
outperforms the fractional power and the geometric mean estimators when a > 1 . 

5 The Related Work 

There have been many studies of normal random projections in machine learning, 
for dimension reduction in the li norm, e.g., |14j . highlighted by the Johnson- 
Lindenstrauss (JL) Lemma |15| . which says k = O (logn/e 2 ) suffices when using 
normal (or normal-like, e.g., [TB]) projection methods. 

The method of stable random projections is applicable for computing the 
l a distances (0 < a < 2), not just for l 2 - [U Lemma 1, Lemma 2, Theorem 3] 
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Fig. 6. Empirical mean square errors (MSE, the lower the better), from 10 7 
simulations at every combination of a and k. The values are multiplied by k 
so that four plots can be at about the same scale. The MSE for the geometric 
mean (gm) estimator is computed exactly since closed-form expression exists. 
The lower dashed curves are the asymptotic variances of the optimal quantile 
(oq) estimator. 



suggested the median (i.e., q — 0.5 quantile) estimator for a = 1 and argued that 
the sample complexity bound should be O (l/e 2 ) (n = 1 in their study). Their 
bound was not provided in an explicit form and required an "e is small enough" 
argument. For a^l, 1, Lemma 4] only provided a conceptual algorithm, which 
"is not uniform." In this study, we prove the bounds for any g-quantile and any 
< a < 2 (not just a = 1), in explicit exponential forms, with no unknown 
constants and no restriction that "e is small enough." 

The quantile estimator for stable distributions was proposed in statistics 
quite some time ago, e.g., |17|18j . [17] mainly focused on 1 < a < 2 and rec- 
ommended using q = 0.44 quantiles (mainly for the sake of smaller bias). [18) 
focused on 0.6 < a < 2 and recommended q = 0.5 quantiles. 

This study considers all < a < 2 and recommends q based on the minimum 
asymptotic variance. Because the bias can be easily removed (at least in the 
practical sense) , it appears not necessary to use other quantiles only for the sake 
of smaller bias. Tail bounds, which are useful for choosing q and k based on 
confidence intervals, were not available in |17|18j . 

Finally, one might ask if there might be better estimators. For a = 1, [19] pro- 
posed using a linear combination of quantiles (with carefully chosen coefficients) 
to obtain an asymptotically optimal estimator for the Cauchy scale parameter. 
While it is possible to extend their result to general < a < 2 (requiring some 
non-trivial work), whether or not it will be practically better than the optimal 
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Fig. 7. The right tail probabilities (the lower the better), from 10 7 simulations 
at each combination of a and k. 

quantile estimator is unclear because the extreme quantiles severely affect the 
tail probabilities and finite-sample variances and hence some kind of truncation 
(i.e., discarding some samples at extreme quantiles) is necessary. Also, exponen- 
tial tail bounds of the linear combination of quantiles may not exist or may not 
be feasible to derive. In addition, the optimal quantile estimator is computation- 
ally more efficient. 

6 Conclusion 

Many machine learning algorithms operate on the training data only through 
pairwise distances. Computing, storing, updating and retrieving the "matrix" of 
pairwise distances is challenging in applications involving massive, high-dimensional, 
and possibly streaming, data. For example, the pairwise distance matrix can not 
fit in memory when the number of observations exceeds 10 6 (or even 10 5 ). 

The method of stable random projections provides an efficient mechanism for 
computing pairwise distances using low memory, by transforming the original 
high-dimensional data into sketches, i.e., a small number of samples from in- 
stable distributions, which are much easier to store and retrieve. 

This method provides a uniform scheme for computing the l a pairwise dis- 
tances for all < a < 2. Choosing an appropriate a is often critical to the 
performance of learning algorithms. In principle, we can tune algorithms for 
many l a distances; and stable random projections can provide an efficient tool. 

To recover the original distances, we face an estimation task. Compared with 
previous estimators based on the geometric mean, the harmonic mean, or the 
fractional power, the proposed optimal quantile estimator exhibits two advan- 
tages. Firstly, the optimal quantile estimator is nearly one order of magnitude 
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more efficient than other estimators (e.g., reducing the training time from one 
week to one day) . Secondly, the optimal quantile estimator is considerably more 
accurate when a > 1, in terms of both the variances and error (tail) proba- 
bilities. Note that a > 1 corresponds to a convex norm (satisfying the triangle 
inequality), which might be another motivation for using l a distances with a > 1. 

One theoretical contribution is the explicit tail bounds for general quantile 
estimators and consequently the sample complexity bound k = 0(logn/e 2 ). 
Those bounds may guide practitioners in choosing fc, the number of projections. 
The (practically useful) bounds are expressed in terms of the probability func- 
tions and hence they might be not as convenient for further theoretical analysis. 
Also, we should mention that the bounds do not recover the optimal bound of 
the arithmetic mean estimator when a = 2, because the arithmetic mean es- 
timator is statistically optimal at a — 2 but the optimal quantile estimator is 
not. 

While we believe that applying stable random projections in machine learning 
has become straightforward, there are interesting theoretical issues for future 
research. For example, how theoretical properties of learning algorithms may be 
affected if the approximated (instead of exact) l a distances are used? 



A Proof of Lemma [T] 

Denote fx (x; a, rf( Q )) and Fx (a;; a, g?( q )) the probability density function and 
the cumulative density function of X ~ S(a, (i( a )), respectively. Similarly we use 
fz (z; a, d( Q )) and Fz (z; a, d( a )) for Z = \X\. Due to symmetry, the following 
relations hold 

fz (z;a,d (a) ) = 2f x {z;a,d {a) ) = 2/d^fx (z/d^°; a, l) , 

F z (z; a, d (a) ) = 2F X (z; a, d (a) ) - 1 = 2F X (z/d^; a, l) - 1, 

F z 1 (q; a, d (a) ) = F^ ((q + l)/2; a, d (a) ) = d^F* 1 ((q + l)/2; a, 1) . 

Let W = g-Quantile{|5'(Q, 1)|} = F x x ((q + l)/2; a, 1) and W d = F z x (q; a, d( a )) = 
dVjtW. Then, following known statistical results, e.g., [T2J, Theorem 9.2], the 
asymptotic variance of da(q should be 

Varfe) * + 0(TU1 q ~ q2 . + ofl 



f*(W d ;a,d (a) )W* W k d -V<* f 2(W;a,l)W 



k 2 



k4d^ /a f x (W;a,l)W2 U 
By "delta method," i.e., Vax(h(x)) « Var(:r) (h'{x)) 2 , 

v „ („-..„) (*.,) «■"■)' + o fe) = y fcffffi, *. + o (±[ 
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B Proof of Lemma [2] 

First, consider a = 1. In this case, 

f x (x; 1, 1) = ~^—r, W = ((g + l)/2; 1, 1) = tan (^q) , 

7T X + 1 \ 2 / 

q-q 2 q-q 2 2 



9W 1) = x^2 = ,-:^2/_„N 7r 

T tan 2 (^q) + l 



tan 2 (§< 



'fa) 



It suffices to study L(g) = logg(g; 1). 



,. , 1 1 27rcos(7rg) „, , 1 
L (<?) = - - i r - „,,,_^ ■ L (l) =—3- 71 + 



g 1 — g sin(7rg) ' g 2 (1 — g) 2 sin (7rg) 

Because sin(x) < x for x > 0, it is easy to see that . 7 — - > 0, and 
1 - 1 > 0. Thus, L" > 0, i.e., L(q) is convex and 



sin(7rg) 1 — q sin(7r(l — q)) 1 — q 

so is g(q; 1) = e L ^\ Since L'(l/2) = 0, wc know g*(l) = 0.5. 

Next we consider a = 0+, using a fact [5] that as a — ► 0+, \S(a, l)\ a con- 
verges to I /Ei, where E\ stands for an exponential distribution with mean 1. 
Denote h = d(o+) and Zj ~ h/Ei. The sample quantile estimator becomes 

- g-Quantile{|jZj|,j = 1,2, ...,k] 

«(o+),s - 



In this case, 



g-Quantile{ } 
fz(z;h) = e~ h/z -!\, Fz 1 (q;h) = -- h 



log Q 

Var(d (0+ ), 9 )-I^^ /l2 + °(F 
It is straightforward to show that }~% is a convex function of q and the 

q log z g ^ 

minimum is attained by solving — logg* + 2q* — 2 = 0, i.e., q* = 0.203. 
C Proof of Lemma [3] 

Given k i.i.d. samples, Xj ~ S(a, d( a )), j = 1 to fc. Let Zj = \xj\, j = 1 to fc. 
Denote by .Fz(t; a, <i( Q )) the cumulative density of Zj, and by Fz,k(t] a, the 
empirical cumulative density of Zj, j = 1 to k. 

It is the basic fact [12] about order statistics that kFz,h (t; a, d( a \ ) follows a 
binomial, i.e., kFz.k(t] a, ^(q)) ~ Bin{k, Fz{t\ a, <i( Q )))- For simplicity, we replace 
F z {t;a,d {a) ) by F(t,d), F Zik {t;a,d( a) ) by F k (t,d), and d (ct) by rf, in this proof. 

Using the original binomial Chernoff bounds [20| , we obtain, for e' > 0, 

Pr (fcF fc (t; d) > (1 + e')kF (t; d)) 

k - kF(t; d) \ k-Hi+t')F(t;d) , kF{t . d) x (nV)*f (*!*) 
- (1 + e')kF(t- d)J + e')kF(t-d)) 

l_(l + e ')F(t;d) / , v (l + e ')F(t;d)] fc 



/ 1-F(t;d) V 

U - (i + e')F(t;d),' 



1 + e' 
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and for < e' < 1, 



Pr (kF k (t;d) < (1 - e')kF(t;d)) 

1 — F(t;d) \ i-Ci-e'^C*;* / i \ (i-«')-P(*i<«) 
1 - (1 - e') F (*; d ) 



Consider the general quantile estimator dr a \ q defined in For e > 0, 
(again, denote W = g-quantile{| S^a, 1)|}), 

Pr (d(c), s > (1 + = Pr (q-quantilc{|z., |}) > ((1 + e)d) 1/a W0 

=Pr (kF k ((1 + e) 1/Q W; l) < gfe) = Pr (kF k (t: 1) < (1 - e')kF{t; 1)) , 

where t = (1 + e) 1/a TV and g = (1 - e')F(i; 1). Thus 

Pr (<*(«),„ > (l + e)d) 

1 - F (((1 + e)) 1 /" W; l) \ 1_S / F (((1 + e)) 1 /" W; l) 



exp I — k 



where 



Gr a 



-(l-g)log(l-.F((l + e) 1/a W;l)) 



- glog [F ((1 + e) 1/a W; l)) + (1 - q) log(l - g) + glog(g). 
For < e < 1, 

Pr (d(„),, < (1 - = Pr ((1 - W"; l) > 1 k ) = Pr 1)>(1 + e')*^(*i *)) . 

where t = (1 - e) 1/Q W and g = (1 + e / )F(t; 1). Thus, 



Pr(d (Q) ,, < (l-e)d) 

1 - F ((1 - e) 1/Q W; l) 



1 ((1 - e) 1/Q W; l) 



■ exp [ — k 



Gl, 



where 



Gl., 



-(l-g)log(l-F((l-e) 1/a W;l)) 



glog (F ((1 - e) 1/a W;l)) + (1 - g) log(l - g) + glog(g). 
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Denote f(t; d) = F'(t; d). Using L'Hospital's rule 

-(1 - q) log (l-F ((1 + e) 1/a W; l)) 



1 

lim — lim 



+ 



— lim 



-q log (F ((1 + <0 1/Q W; l) ) + (1 - q) log(l - q) + q log(g) 

/ ((1 + c) 1 /" W; l) ^ (1 + c) 1 /"- 1 F ((1 + e) 1 /" W; l) - q 



= 1 



- 0+ F ((1 + e) 1/Q VK; l) (l - F ((1 + e) 1 /" W; l) ) 
( / ( (1 + e) i/» W;1 )^ (1 + e) i/=-i) 2 



0+ 2F ((1 + e) 1 /" VK; l) (l - F ((1 + e) 1 /" W; l)) 
f 2 (W- 1) IV 2 



2?(1 - 9)" 2 

Similarly 



(g = F(W,l)). 

2q(l-q)a 2 



lim Gl,„ = 



/ 2 (W;l) W 2 ' 

To complete the proof, apply the relations on Z = \X\ in the proof of Lemma [TJ 
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Abstract. Th<0 method of stable random projections is a tool for effi- 
ciently computing the l a distances using low memory, where < a < 2 
is a tuning parameter. The method boils down to a statistical estimation 
task and various estimators have been proposed, based on the geometric 
mean, the harmonic mean, and the fractional power etc. 
This study proposes the optimal quantile estimator, whose main op- 
eration is selecting, which is considerably less expensive than taking 
fractional power, the main operation in previous estimators. Our exper- 
iments report that the optimal quantile estimator is nearly one order 
of magnitude more computationally efficient than previous estimators. 
For large-scale learning tasks in which storing and computing pairwise 
distances is a serious bottleneck, this estimator should be desirable. 
In addition to its computational advantages, the optimal quantile estima- 
tor exhibits nice theoretical properties. It is more accurate than previous 
estimators when a > 1. We derive its theoretical error bounds and es- 
tablish the explicit (i.e., no hidden constants) sample complexity bound. 

1 Introduction 

The method of stable random vroiections \ll2l'3\ , as an efficient tool for comput- 
ing pairwise distances in massive high-dimensional data, provides a promising 
mechanism to tackle some of the challenges in modern machine learning. In this 
paper, we provide an easy-to-implement algorithm for stable random projections 
which is both statistically accurate and computationally efficient. 

1.1 Massive High-dimensional Data in Modern Machine Learning 

We denote a data matrix by A S R™ x£) , i.e., n data points in D dimensions. 
Data sets in modern applications exhibit important characteristics which impose 
tremendous challenges in machine learning [3]: 

- Modern data sets with n = 10 5 or even n = 10 6 points are not uncommon in 
supervised learning, e.g., in image/text classification, ranking algorithms for 
search engines, etc. In the unsupervised domain (e.g., Web clustering, ads 
clickthroughs, word/term associations), n can be even much larger. 

- Modern data sets are often of ultra high-dimensions (D), sometimes in the 
order of millions (or even higher), e.g., image, text, genome (e.g., SNP), etc. 
For example, in image analysis, D may be 10 3 x 10 3 = 10 6 if using pixels as 
features, or D = 256 3 « 16 million if using color histograms as features. 

- Modern data sets are sometimes collected in a dynamic streaming fashion. 

1 First draft Feb. 2008, slightly revised in June 2008. The results were announced in 
January 2008 at SODA'08 when the author presented the work of [2]. 
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- Large-scale data are often heavy-tailed, e.g., image and text data. 

Some large-scale data are dense, such as image and genome data. Even for 
data sets which are sparse, such as text, the absolute number of non-zeros may 
be still large. For example, if one queries "machine learning" (a not-too-common 
term) in Google.com, the total number of pagehits is about 3 million. In other 
words, if one builds a term-doc matrix at Web scale, although the matrix is 
sparse, most rows will contain large numbers (e.g., millions) of non-zero entries. 

1.2 Pairwise Distances in Machine Learning 

Many learning algorithms require a similarity matrix computed from pairwise 
distances of the data matrix A £ R TlXjD . Examples include clustering, nearest 
neighbors, multidimensional scaling, and kernel SVM (support vector machines). 
The similarity matrix requires 0{n 2 ) storage space and 0(n 2 D) computing time. 

This study focuses on the l a distance (0 < a < 2). Consider two vectors u%, 
U2 6 M. D (e.g., the leading two rows in A), the l a distance between u\ and ui is 

D 

d( a ) =^2\u!a — U-2,i\ a ■ (1) 

Note that, strictly speaking, the l a distance should be defined as dV?. Be- 
cause the power operation (.) 1/a is the same for all pairs, it often makes no 
difference whether we use d 1 ^ or just d( Q ); and hence we focus on <i( a ). 

The radial basis kernel (e.g., for SVM) is constructed from d^ a ) |5l6j : 

K(wi,u 2 ) =exp -ui,i| a V 0<a<2. (2) 

When a — 2, this is the Gaussian radial basis kernel. Here a can be viewed as 
a tuning parameter. For example, in their histogram-based image classification 
project using SVM, [5] reported that a — and a = 0.5 achieved good perfor- 
mance. For heavy-tailed data, tuning a has the similar effect as term-weighting 
the original data, often a critical step in a lot of applications |7l8j . 

For popular kernel SVM solvers including the Sequential Minimal Optimiza- 
tion (SMO) algorithm [9 , storing and computing kernels is the major bottleneck. 
Three computational challenges were summarized in [4j page 12]: 

Computing kernels is expensive 

Computing full kernel matrix is wasteful Efficient SVM solvers 

often do not need to evaluate all pairwise kernels. 

Kernel matrix does not fit in memory Storing the kernel matrix 
at the memory cost 0(n 2 ) is challenging when n > 10 5 , and is not realistic 
for n > 10 6 , because O (10 12 ) consumes at least 1000 GBs memory. 

A popular strategy in large-scale learning is to evaluate distances on the 
fly [4]. That is, instead of loading the similarity matrix in memory at the cost of 
0(n 2 ), one can load the original data matrix at the cost of 0(nD) and recompute 
pairwise distances on-demand. This strategy is apparently problematic when D 
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is not too small. For high-dimensional data, either loading the data matrix in 
memory is unrealistic or computing distances on-demand becomes too expensive. 

Those challenges are not unique to kernel SVM; they are general issues in 
distanced-based learning algorithms. The method of stable random projections 
provides a promising scheme by reducing the dimension D to a small k (e.g., 
k = 50), to facilitate compact data storage and efficient distance computations. 

1.3 Stable Random Projections 

The basic procedure of stable random projections is to multiply A £ M. nxD 
by a random matrix R £ IR- Dxfc (k <C D), which is generated by sampling each 
entry ry i.i.d. from a symmetric stable distribution S(a, 1). The resultant matrix 
B = A x R G M. nxk is much smaller than A and hence it may fit in memory. 

Suppose a stable random variable x ~ S(a, d), where d is the scale parameter. 
Then its characteristic function (Fourier transform of the density function) is 

E (exp {yf^lxt)) = exp (-d\t\ a ) , 

which does not have a closed-form inverse except for a — 2 (normal) or a = 1 
(Cauchy). Note that when a — 2, d corresponds to "cr 2 " (not "er") in a normal. 

Corresponding to the leading two rows in A, iti, U2 £ M. D , the leading two 
rows in B are v\ = R T iti, t>2 = R T U2- The entries of the difference, 

D / D 

i=l \ i=l 

for j = 1 to fc, are i.i.d. samples from a stable distribution with the scale pa- 
rameter being the l a distance d( Q ), due to properties of Fourier transforms. For 
example, when a — 2, a weighted sum of i.i.d. standard normals is also normal 
with the scale parameter (i.e., variance) being the sum of squares of all weights. 

Once we obtain the stable samples, one can discard the original matrix A 
and the remaining task is to estimate the scale parameter rf( Q ) for each pair. 

Some applications of stable random projections are summarized as follows: 

- Computing all pairwise distances The cost of computing all pair- 
wise distances of A e R rix15 , 0(n 2 D), is significantly reduced to 0(nDk+n 2 k). 

- Estimating l a distances online For n > 10 5 , it is challenging or 
unrealistic to materialize all pairwise distances in A. Thus, in applications 
such as online learning, databases, search engines, and online recommenda- 
tion systems, it is often more efficient if we store B 6 M. nxk in the memory 
and estimate any distance on the fly if needed. Estimating distances online 
is the standard strategy in large-scale kernel learning[4]. With stable random 
projections, this simple strategy becomes effective in high-dimensional data. 

- Learning with dynamic streaming data In reality, the data ma- 
trix may be updated overtime. In fact, with streaming data arriving at high- 
rate PHO], the "data matrix" may be never stored and hence all operations 
(such as clustering and classification) must be conducted on the fly. The 
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method of stable random projections provides a scheme to compute and up- 
date distances on the fly in one-pass of the data; see relevant papers (e.g., 
PQ) for more details on this important and fast-developing subject. 

- Estimating entropy The entropy distance Ylf=i iia,»| log U2,i \ is 
a useful statistic. A workshop in NIPS'03 (www.menem. com/~ilya/pages/NIPS03 ) 
focused on entropy estimation. A recent practical algorithm is simply us- 
ing the difference between the l ai and l a2 distances [11], where a\ = 1.05, 
0L2 — 0.95, and the distances were estimated by stable random projections. 

If one tunes the l a distances for many different a (e.g., [5]), then stable 
random projections will be even more desirable as a cost-saving device. 

2 The Statistical Estimation Problem 

Recall that the method of stable random projections boils down to a statistical 
estimation problem. That is, estimating the scale parameter from k i.i.d. 
samples Xj ~ S(a, d( Q )), j = 1 to k. We consider that a good estimator d( a ) 
should have the following desirable properties: 

- (Asymptotically) unbiased and small variance. 

- Computationally efficient. 

- Exponential decrease of error (tail) probabilities. 

The arithmetic mean estimator A E =i \xj\' 2 is good for a = 2. When a < 2, 
the task is less straightforward because (1) no explicit density of Xj exists unless 
a = 1 or 0+; and (2) E(|xj|*) < oo only when — 1 < t < a. 

2.1 Several Previous Estimators 

Initially reported in arXiv in 2006, [5] proposed the geometric mean estimator 

n k \ rr -.\ a / k 

[f^(tK(i-i)^(lt)] 

where -T(.) is the Gamma function, and the harmonic mean estimator 

? -fr(-a) sin (fa) / / -gT(-2a) sin (tto) 

= ; — ; « — 7 — ; ; : ttt — 1 

E-=xM- a V V [r(-a)sin(fa)] 2 

More recently, [3] proposed the fractional power estimator 

v-^fc I lA*a \ 1 / A " 

dr, 



k |r(l-A*)r(A*a)sin(fA*a) 



where 



i _ lj_ /J_ _ A / |r(l-2A*)r(2A-"Q)sin(^A*Q) _ x 
fc2A* ^A* J \ v [|r(l-A*)r(A*a)sin(fA*a)] 2 

1 / ^r(l - 2A)r(2Aa)sin(7rAa) 
A = argmin - 1 — 



i,A<i A2 V[§ni-A)r(Aa)sin(fAa)]' 




0.20.40.60.8 1 1.21.41.61.8 2 

a 

Fig. 1. The Cramer- Rao efficiencies (the higher the better, max = 100%) of 
various estimators, including the optimal quantile estimator proposed in this 
study. 

All three estimators are unbiased or asymptotically (as k — * oo) unbiased. 
Figure [1] compares their asymptotic variances in terms of the Cramer- Rao effi- 
ciency, which is the ratio of the smallest possible asymptotic variance over the 
asymptotic variance of the estimator, as k — > oo. 

The geometric mean estimator, di a \ gm exhibits tail bounds in exponential 
forms, i.e., the errors decrease exponentially fast: 



Pr Id 



1(a), gm 



> ed( a ) J < 2 exp 



The harmonic mean estimator, dr a \hm> works well for small a, and has ex- 
ponential tail bounds for a = 0+. 

The fractional power estimator, di a yf p , has smaller asymptotic variance than 
both the geometric mean and harmonic mean estimators. However, it docs not 
have exponential tail bounds, due to the restriction — 1 < A* a < a in its definition. 
As shown in 3 , it only has finite moments slightly higher than the 2nd order, 
when a approaches 2 (because A* — > 0.5), meaning that large errors may have a 
good chance to occur. We will demonstrate this by simulations. 



2.2 The Issue of Computational Efficiency 



In the definitions of d, 



(a),gr. 



^(a),/ir 



and d 



all three estimators require 



evaluating fractional powers, e.g., \xj\ a / k . This operation is relatively expensive, 
especially if we need to conduct this tens of billions of times (e.g., n 2 = 10 10 ). 

For example, [5] reported that, although the radial basis kernel ^ with 
a = 0.5 achieved good performance, it was not preferred because evaluating the 
square root was too expensive. 

2.3 Our Proposed Estimator 

We propose the optimal quantile estimator, using the <?*th smallest \xj\: 



d 



(a),oq 



(q*-quimti\e{\xj\,j = 1,2, ...,k}) a 



(3) 



where q* = q*(a) is chosen to minimize the asymptotic variance. 
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This estimator is computationally attractive because selecting should be 
much less expensive than evaluating fractional powers. If we are interested in 



d 1 /? instead, then we do not even need to evaluate any fractional powers. 



As mentioned, in many cases using either or d x J? makes no difference 
and dtoA is often preferred because it avoids taking (.) 1 ' Q power. The radial basis 
kernel ([2|) requires Thus this study focuses on On the other hand, if 
we can estimate d 1 /? directly, for example, using §3§ without the ath power, we 

might as well just use d 1 ^ if permitted. In case we do not need to evaluate any 
fractional power, our estimator will be even more computationally efficient. 

In addition to the computational advantages, this estimator also has good 
theoretical properties, in terms of both the variances and tail probabilities: 

1. Figure [T] illustrates that, compared with the geometric mean estimator, its 
asymptotic variance is about the same when a < 1, and is considerably 
smaller when a > 1. Compared with the fractional power estimator, it has 
smaller asymptotic variance when 1 < a < 1.8. In fact, as will be shown by 
simulations, when the sample size k is not too large, its mean square errors 
are considerably smaller than the fractional power estimator when a > 1. 

2. The optimal quantile estimator exhibits tail bounds in exponential forms. 
This theoretical contribution is practically important, for selecting the sam- 
ple size k. In learning theory, the generalization bounds are often loose. In 
our case, however, the bounds are tight because the distribution is specified. 

The next section will be devoted to analyzing the optimal quantile estimator. 

3 The Optimal Quantile Estimator 

Recall the goal is to estimate d( Q ) from {xj}j =1 , where Xj ~ S(a, d(o,)), i.i.d. Since 
the distribution belongs to the scale family, one can estimate the scale parameter 
from quantiles. Due to symmetry, it is natural to consider the absolute values: 

- _ / g-QuantilelJzjl, j = l,2,...,fc} \° 
{a) ' q \ g-Quairfcile{ \S(a, 1)|} J ' (> 

which is best understood by the fact that if x ~ S(a, 1), then d^^x ~ S(a,d), or 
more obviously, if x ~ iV(0, 1), then (o- 2 ) 1/2 x ~ N (0,ct 2 ). By properties of order 
statistics |12j . any g-quantile will provide an asymptotically unbiased estimator. 
Lemma [T] provides the asymptotic variance of d( a ),q- 

Lemma 1. Denote fx (x;a,d( a )) and Fx (x;a,d( a 'f) the probability density func- 
tion and the cumulative density function of X ~ S(a,d( a )), respectively. 
The asymptotic variance of d^ q defined in is 

where W = F^ 1 ((q + l)/2; a, 1) = q-Quantile{\S(a, 1)|}. 
Proof: See Appendix\fi\ □. 
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3.1 Optimal Quantile q*(ot) 

We choose q — q*(a) so that the asymptotic variance ((5]) is minimized, i.e., 

_ 2 

q*(a) = argmin g(q;a), g(q;a) = f2 (y^- a 1)W 2 ' ^ 

The convexity of g(q] a) is important. Graphically, g(q; a) is a convex function 
of q, i.e., a unique minimum exists. An algebraic proof, however, is difficult. 
Nevertheless, we can obtain analytical solutions when a = 1 and a = 0+. 

Lemma 2. When a = 1 or a = 0+. the junction g(q; a) defined in {6p is a 
convex function of q. When a = 1, the optimal q*(l) — 0.5. When a = 0+, 
<f (0+) = 0.203 is the solution to ~\ogq* + 2q* - 2 = 0. 
Proof: See AvvendixWl □. 

It is also easy to show that when a = 2, g* (2) = 0.862. 

We denote the optimal quantile estimator by d( Q ) i0g , which is same as d( a \ q * . 
For general a, we resort to numerical solutions, as presented in Figure [5] 




0.20.40.60.8 1 1.21.41.61.8 2 0.20.40.60.8 1 1.21.41.61.8 2 



a a 
(a) q* (b) W a (q*) 

Fig. 2. (a) The optimal values for q*(a), which minimizes asymptotic vari- 
ance of d,t a \ q , i.e., the solution to ©. (b) The constant W a (q*) = 
{g*-quantile{|5(a,l)|}} Q . 

3.2 Bias Correction 

Although di a ), q (i- e -i d(a),q") is asymptotically (as k — > oo) unbiased, it is 
seriously biased for small k. Thus, it is practically important to remove the bias. 
The unbiased version of the optimal quantile estimator is 

d(a),oq,c = d(a),og/-Ba,)b, (7) 

where B a ^ is the expectation of d(a),oq & t ^(a) = 1- For a = 1, 0+, or 2, 
we can evaluate the expectations (i.e., integrals) analytically or by numerical 
integrations. For general a, as the probability density is not available, the task 
is difficult and prone to numerical instability. On the other hand, since the 
Monte-Carlo simulation is a popular alternative for evaluating difficult integrals, 
a practical solution is to simulate the expectations, as presented in Figure [31 

Figure [3J illustrates that B a ,k > 1, meaning that this correction also reduces 
variance while removing bias (because Var(a:/c) = Var(a;)/c 2 ). For example, when 




a = 0.1 and k = 10, B a ^ « 1.24, which is significant, because 1.24 2 = 1.54 
implies a 54% difference in terms of variance, and even more considerable in 
terms of the mean square errors MSE = variance + bias 2 . 

B a ^ can be tabulated for small k, and absorbed into other coefficients, i.e., 
this does not increase the computational cost at run time. We fix B a ^ as reported 
in Figure [3] The simulations in Section [4] directly used those fixed B a ^ values. 

3.3 Computational Efficiency 

Figure [4] compares the computational costs of the geometric mean, the fractional 
power, and the optimal quantile estimators. The harmonic mean estimator was 
not included as it costs very similarly to the fractional power estimator. 

We used the build-in function "pow" in gec for evaluating the fractional pow- 
ers. We implemented a "quick select" algorithm, which is similar to quick sort 
and requires on average linear time. For simplicity, our implementation used 
recursions and the middle element as pivot. Also, to ensure fairness, for all esti- 
mators, coefficients which are functions of a and/or k were pre-computed. 




Fig. 4. Relative computational cost (d( a ),<?m over d( a ),o 9 , c and d( a ) iSm over d( a )jp), 
from 10 6 simulations at each combination of a and k. The left panel averages 
over all k and the right panel averages over all a. Note that the cost of d( Q ), oq , c 
includes evaluating the ath moment once. 

Normalized by the computing time of d( a ), gm , we observe that relative com- 
putational efficiency does not strongly depend on a. We do observe that the ratio 



9 



of computing time of d( a ) igm over that of d/ a \ oqiC increases consistently with in- 
creasing k. This is because in the definition of d^ a )„ q (and hence also it 
is required to evaluate the fractional power once, which contributes to the total 
computing time more significantly at smaller k. 

Figure [4] illustrates that, (A) the geometric mean estimator and the frac- 
tional power estimator are similar in terms of computational efficiency; (B) the 
optimal quantile estimator is nearly one order of magnitude more computation- 
ally efficient than the geometric mean and fractional power estimators. Because 
we implemented a "naive" version of "quick select" using recursions and simple 
pivoting, the actual improvement may be more significant. Also, if applications 
require only d 1 /?, then no fractional power operations are needed for d^ >oq>c and 
the improvement will be even more considerable. 

3.4 Error (Tail) Bounds 

Error (tail) bounds are essential for determining k. The variance alone is not 
sufficient for that purpose. If an estimator of d, say d, is normally distributed, d ~ 
N (d, tV), the variance suffices for choosing k because its error (tail) probability 
Pr (jd- d\ > ed^j < 2exp (—k^J is determined by V. In general, a reasonable 
estimator will be asymptotically normal, for small enough e and large enough k. 
For a finite k and a fixed e, however, the normal approximation may be (very) 
poor. This is especially true for the fractional power estimator, d( a )j p . 

Thus, for a good motivation, Lemma [3] provides the error (tail) probability 
bounds of di a \ q for any q, not just the optimal quantile q*. 

Lemma 3. Denote X ~ S(a,d^) and its probability density function by fx{x;a,d^ 
and cumulative function by Fx{x;a, d( a )). Given Xj ~ S(a,d( a )), i.i.d., j — 1 to 
k. Using d^ q in Hty, then 

2 



G 

e 



R.q 

2 



Pr (d(a), g > (1 + e)d(a)) ^ ex P \ k (^R~) ,e> °' ® 

Pr (d (a) , q < (1 - e)d (a) ) < exp (^-k-^—^j , < e < 1, (9) 

-(1 - q) log (2 - 2F R ) - glog(2F J j - 1) + (1 - q) log(l - q) + glogg, (10) 

-(1-q) log (2 - 2F L ) - glog(2F L - 1) + (1 - q) log(l - q) + glog g, (11) 



w = F^iil + l)/2; a, 1) = q-quanUle{\S(a, 1)|}, 



Fr — Fx 
As e ^ 



((1 + e) 1/a W; a, l) , F L = F x ((1 - e) 1/a W; a, l) 



lim Gfl, q = lim Gh,q = g) f/ T L ■ (12) 

f^o+ E ^o+ /| (VF; a, l)W 2 



Proof: See Appendix^ □ 
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The limit in (|12j) as e — > is precisely twice the asymptotic variance factor of 
d( a ),q m ©j consistent with the normality approximation mentioned previously. 
This explains why we express the constants as e 2 /G. (fT2f also indicates that the 
tail bounds achieve the "optimal rate" for this estimator, in the language of large 
deviation theory. 

By the Bonferroni bound, it is easy to determine the sample size k 
Pr(\d (aU -d (a) \ >«Z {Q) ) <2exp(-*^ < */(n a /2) =>■ k > ^ (2 log n - log 5) . 

Lemma 4. Using d{ a ),q with k > ^ (21ogn — log<5), any pairwise l a distance 
among n points can be approximated within a lie factor with probability > 1 — 8. 
It suffices to let G = max{Gjj, 9 , Gh, q }, where Gn, q , Gh, q are defined in Lemma\^ 

The Bonferroni bound can be unnecessarily conservative. It is often reason- 
able to replace <5/(n 2 /2) by 6/T, meaning that except for a l/T fraction of pairs, 
any distance can be approximated within a 1 ± e factor with probability 1 — 5. 

Figure [5] plots the error bound constants for e < 1, for both the recom- 
mended optimal quantile estimator dt a \ oq and the baseline sample median esti- 
mator d( a ),q=o.5- Although we choose di a ) )0q based on the asymptotic variance, 
it turns out d^ a )^ oq also exhibits (much) better tail behaviors (i.e., smaller con- 
stants) than d( a ),(j=o.5j at least in the range of e < 1. 




Consider k = % (log2T — log S) (recall we suggest replacing n 2 /2 by T), with 
S = 0.05, e = 0.5, and T — 10. Because G_r )(J » w5~9 around e = 0.5, we obtain 
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k 120 ~ 215, which is still a relatively large number (although the original 
dimension D might be 10 6 ). If we choose e = 1, then approximately k « 40 ~ 65. 

It is possible k = 120 ~ 215 might be still conservative, for three reasons: 
(A) the tail bounds, although "sharp," are still upper bounds; (B) using G = 
max{G/? i9 » , Gl, 9 * } is conservative because Gr,,q* is usually much smaller than 
Gfl„ 9 *; (C) this type of tail bounds is based on relative error, which may be 
stringent for small (« 0) distances. 

In fact, some earlier studies on normal random projections (i.e., a — 2) [13|14j 
empirically demonstrated that k > 50 appeared sufficient. 

4 Simulations 

We resort to simulations for comparing the finite sample variances of various 
estimators and assessing the more precise error (tail) probabilities. 

One advantage of stable random projections is that we know the (manually 
generated) distributions and the only source of errors is from the random number 
generations. Thus, we can simply rely on simulations to evaluate the estimators 
without using real data. In fact, after projections, the projected data follow 
exactly the stable distribution, regardless of the original real data distribution. 

Without loss of generality, we simulate samples from S(a, 1) and estimate the 
scale parameter (i.e., 1) from the samples. Repeating the procedure 10 7 times, 
we can reliably evaluate the mean square errors (MSE) and tail probabilities. 

4.1 Mean Square Errors (MSE) 

As illustrated in Figure [6l in terms of the MSE, the optimal quantile estimator 
d( a ),oq,c outperforms both the geometric mean and fractional power estimators 
when a > 1 and k > 20. The fractional power estimator does not appear to be 
very suitable for a > 1, especially for a close to 2, even when the sample size k is 
not too small (e.g., k = 50). For a < 1, however, the fractional power estimator 
has good performance in terms of MSE, even for small k. 

4.2 Error(Tail) Probabilities 

Figure [Jj presents the simulated right tail probabilities, Pr (d( Q ) > (l + e)d( a )j, 
illustrating that when a > 1, the fractional power estimator can exhibit very 
bad tail behaviors. For a < 1, the fractional power estimator demonstrates good 
performance at least for the probability range in the simulations. 

Thus, Figure [7] demonstrates that the optimal quantile estimator consistently 
outperforms the fractional power and the geometric mean estimators when a > 1 . 

5 The Related Work 

There have been many studies of normal random projections in machine learning, 
for dimension reduction in the li norm, e.g., |14j . highlighted by the Johnson- 
Lindenstrauss (JL) Lemma |15| . which says k = O (logn/e 2 ) suffices when using 
normal (or normal-like, e.g., [TB]) projection methods. 

The method of stable random projections is applicable for computing the 
l a distances (0 < a < 2), not just for l 2 - [U Lemma 1, Lemma 2, Theorem 3] 
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Fig. 6. Empirical mean square errors (MSE, the lower the better), from 10 7 
simulations at every combination of a and k. The values are multiplied by k 
so that four plots can be at about the same scale. The MSE for the geometric 
mean (gm) estimator is computed exactly since closed-form expression exists. 
The lower dashed curves are the asymptotic variances of the optimal quantile 
(oq) estimator. 



suggested the median (i.e., q — 0.5 quantile) estimator for a = 1 and argued that 
the sample complexity bound should be O (l/e 2 ) (n = 1 in their study). Their 
bound was not provided in an explicit form and required an "e is small enough" 
argument. For a^l, 1, Lemma 4] only provided a conceptual algorithm, which 
"is not uniform." In this study, we prove the bounds for any g-quantile and any 
< a < 2 (not just a = 1), in explicit exponential forms, with no unknown 
constants and no restriction that "e is small enough." 

The quantile estimator for stable distributions was proposed in statistics 
quite some time ago, e.g., |17|18j . [17] mainly focused on 1 < a < 2 and rec- 
ommended using q = 0.44 quantiles (mainly for the sake of smaller bias). [18) 
focused on 0.6 < a < 2 and recommended q = 0.5 quantiles. 

This study considers all < a < 2 and recommends q based on the minimum 
asymptotic variance. Because the bias can be easily removed (at least in the 
practical sense) , it appears not necessary to use other quantiles only for the sake 
of smaller bias. Tail bounds, which are useful for choosing q and k based on 
confidence intervals, were not available in |17|18j . 

Finally, one might ask if there might be better estimators. For a = 1, [19] pro- 
posed using a linear combination of quantiles (with carefully chosen coefficients) 
to obtain an asymptotically optimal estimator for the Cauchy scale parameter. 
While it is possible to extend their result to general < a < 2 (requiring some 
non-trivial work), whether or not it will be practically better than the optimal 
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Fig. 7. The right tail probabilities (the lower the better), from 10 7 simulations 
at each combination of a and k. 

quantile estimator is unclear because the extreme quantiles severely affect the 
tail probabilities and finite-sample variances and hence some kind of truncation 
(i.e., discarding some samples at extreme quantiles) is necessary. Also, exponen- 
tial tail bounds of the linear combination of quantiles may not exist or may not 
be feasible to derive. In addition, the optimal quantile estimator is computation- 
ally more efficient. 

6 Conclusion 

Many machine learning algorithms operate on the training data only through 
pairwise distances. Computing, storing, updating and retrieving the "matrix" of 
pairwise distances is challenging in applications involving massive, high-dimensional, 
and possibly streaming, data. For example, the pairwise distance matrix can not 
fit in memory when the number of observations exceeds 10 6 (or even 10 5 ). 

The method of stable random projections provides an efficient mechanism for 
computing pairwise distances using low memory, by transforming the original 
high-dimensional data into sketches, i.e., a small number of samples from in- 
stable distributions, which are much easier to store and retrieve. 

This method provides a uniform scheme for computing the l a pairwise dis- 
tances for all < a < 2. Choosing an appropriate a is often critical to the 
performance of learning algorithms. In principle, we can tune algorithms for 
many l a distances; and stable random projections can provide an efficient tool. 

To recover the original distances, we face an estimation task. Compared with 
previous estimators based on the geometric mean, the harmonic mean, or the 
fractional power, the proposed optimal quantile estimator exhibits two advan- 
tages. Firstly, the optimal quantile estimator is nearly one order of magnitude 
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more efficient than other estimators (e.g., reducing the training time from one 
week to one day) . Secondly, the optimal quantile estimator is considerably more 
accurate when a > 1, in terms of both the variances and error (tail) proba- 
bilities. Note that a > 1 corresponds to a convex norm (satisfying the triangle 
inequality), which might be another motivation for using l a distances with a > 1. 

One theoretical contribution is the explicit tail bounds for general quantile 
estimators and consequently the sample complexity bound k — O (logn/e 2 ). 
Those bounds may guide practitioners in choosing fc, the number of projections. 
We should mention that 

While we believe that applying stable random projections in machine learning 
has become straightforward, there are interesting theoretical issues for future 
research. For example, how theoretical properties of learning algorithms may be 
affected if the approximated (instead of exact) l a distances are used? 

A Proof of Lemma [1] 

Denote fx (x;a, rf( Q )) and Fx (x;a,d^) the probability density function and 
the cumulative density function of X ~ S(a, e?( Q )), respectively. Similarly we use 
fz (z; a, <i( Q )) and Fz (z;a,rf( Q )) for Z = \X\. Due to symmetry, the following 
relations hold 

fz (z; a, d( a) ) = 2f x (z;a,d (a) ) = 2/d^fx (z / 'd]^ ;a,l \ , 
F z (z; a, d (oi) ) = 2F X (z; a, d (a) ) - 1 = 2F X (z/d^y, a, l) - 1, 

F z 1 (q; a, d (a) ) = F^ 1 ((q + l)/2; a, d (a) ) = d\§ F^ 1 ((q + l)/2; a, 1) . 
Let W = g-Quantile{|5(a, 1)|} = F^ 1 ((q + l)/2; a, 1) and W d = F^ 1 (q\ a, d {a) ) 
d]{?W. Then, following known statistical results, e.g., [HI Theorem 9.2], the 
asymptotic variance of da{q should be 

Var (fr/A - 1 g ~ <? | n ( M - 1 q - q2 n ( M 

yax kP z (W d ;a,d (a) )W* +U \k*) kd^/« f 2 iW . aA)W 2 +U {k*) 

_ 1 Q-Q 2 

k4dl a ] /a p x (W;a,l)W* 

By "delta method," i.e., Vax(h(x)) « Var(:r) (h'{x)) 2 , 

B Proof of Lemma [2] 

First, consider a = 1. In this case, 

1, 1) = W = F^ ({q + l)/2; 1, 1) = tan , 

7T + 1 V 2 / 
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It suffices to study L(g) = log g(q; 1). 



T i, 1 1 2tt cos(Trg) „ _ 1 2tt 2 

W « 1-4 sm(Trg) ' WJ g 2 (1 - g) 2 siring) ' 

Because sinfx) < x for x > 0, it is easy to see that . 7 — - > 0, and 
^f-^ - -r- = — vr - -r- > 0. Thus, L" > 0, i.e., L(q) is convex and 
so is 3(4; 1) = e L ^. Since 1/(1/2) = 0, we know g*(l) = 0.5. 

Next we consider a = 0+, using a fact [2] that as a — > 0+, |5(a, 1)| Q con- 
verges to 1/Ei, where £1 stands for an exponential distribution with mean 1. 
Denote h = d(o+) and Zj ~ h/E\. The sample quantile estimator becomes 

- = g-Quantiled^l, j = 1,2, ...,fc} 
(0+) ' q g-Quantile{l/£i} 

In this case, , , 

fz(z;h) = e- h/z -z, F z\vh) = -^- q , 

It is straightforward to show that }~% is a convex function of a and the 

° q log z g ^ 

minimum is attained by solving — log q* + 2q* — 2 = 0, i.e., q* = 0.203. 



C Proof of Lemma [3] 

Given k i.i.d. samples, Xj ~ ^(a, d( a )), j = 1 to k. Let 2^ = |it?j | , j = 1 to fc. 
Denote by -Fz(i; a, d( a )) the cumulative density of Zj, and by Fz t k(t; a, dt a \) the 
empirical cumulative density of Zj, j = 1 to k. 

It is the basic fact[T2"] about order statistics that kFz,k{t) dr a -\) follows a 
binomial, i.e., kFz,k(t; a, dt a \) ~ Bin(k, Fz(t; a, d( Q ))). For simplicity, we replace 
F z {t;a,d {a) ) by F(t,d), F z ,k(t; at, d( a )) by F k (t,d), and d( a ) by rf, in this proof. 

Using the original binomial Chernoff bounds |20j , we obtain, for e' > 0, 

Pr (kF k (t-d) > (1 + e')kF{t; d)) 

f k-kF(t-d) \ *->>a+c')F<.t;d) / feF (i;d) \< nV)*f (*!*) 
~ V* - (1 + e')kF(t; d)J \(1 + e')kF(t;d) J 

~ U - (l + e')F(t;d)J U + e'J _ ' 

and for < e' < 1, 

Pr (kF k (t;d) < (1 - e')kF(t;d)) 
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Consider the general quantile estimator d( a ), q defined in (J4j) - For e > 0, 
(again, denote W = g-quantileHS^a, 1)|}), 

Pr (d ia ), t > (1 + = Pr(g-quantile{|xj|}) > ((1 + t)d) 1/a W) 
=Pr (kF k ((1 + t) 1/a W; l) < <?fc) = Pr (kF k (t- 1) < (1 - e')fcF(t: 1)) , 

where t = (1 + e) 1/Q W and g = (1 - e')F(t; 1). Thus 

Pr (d (ahq > + 



1 - F (((1 + e)) 1 /" W; l) \ 1 " / F (((1 + e)) 1 /" W; l) 



1-9 



exp I — k 



where 



Gr, ( 



-(l-g)log(l-F((l + 6) 1 / a W;l)) 



glog (F ((1 + e) 1/a W; l)) + (1 - g) log(l - q) + glog(g). 



For < e < 1, 



Pr (4 (a) ,, < (1 - e)d) = Pr (fcF fe ((1 - e) 1/a W; l) > gfc) = Pr (kF k (t; 1) > (1 + 1)) , 



where t = (1 - e) 1/a W and g = (1 + e')-F(«; 1). Thus 

Pr(d (Q) ,, < (l-e)d) 

' 1 - F ((1 - e) 1/Q W; l) \ 1_9 / F ((1 - e) 1 ^ W; l) 



1 - 9 



■ exp [ — k 



where 



Gl,, 



= -(l-q)log(l-F[(l-e) 1/a W;l)) 



- glog (F ((1 - e) 1/a W;l))+(1- q) log(l - q) + glog(g). 
Denote f(t;d) — F'(t;d). Using L'Hospital's rule 

-(1 - q) log (l - F ((1 + e) 1 /" IV; l)) 



Um — lim - 

*->o+G Rtq ^-o+ 



— lim 



-q log (F ((1 + e) 1 /" W; l) ) + (1 - q) log(l - q-) + q \og(q) 
7 2 

/ ((1 + e) 1 /" W- l) 2f. (1 + c) 1 /^-! F ((1 + e) 1 ^ W; l) - <? 



— lim 



0+ F ((1 + e) 1 /" VK; l) (l - F ((1 + e) 1 /" W: l) ) 
(/({l + «) I/o Wil)f + 



0+ 2F ((1 + e) 1 /" VK; l) (l - F ((1 + e) 1 /" VK; l)) 
/ 2 (W; 1) W 2 



2q(l - q)a 2 

Similarly 



(<? — F(W, 1)). 



2g(l-g)a 2 



To complete the proof, apply the relations on Z = \X\ in the proof of Lemma [TJ 
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